Zipfian distribution (zipfian)#

The Zipfian distribution is a discrete power-law distribution over ranks. It is commonly used when a few items are very frequent (low ranks) and many items are rare (high ranks).

Throughout, we’ll use SciPy’s parameterization scipy.stats.zipfian(a, n), which is a finite-support (truncated) Zipf law on ranks \(\{1,\dots,n\}\).

Notebook roadmap#

  1. Define the distribution (PMF/CDF) and connect it to Zipf’s law.

  2. Work out moments/properties (including when moments exist in the \(n\to\infty\) limit).

  3. Derive likelihood and a practical MLE for the exponent.

  4. Implement NumPy-only sampling via inverse transform.

  5. Visualize PMF/CDF and Monte Carlo behavior.

  6. Use SciPy’s zipfian for evaluation, sampling, and fitting.

import math

import numpy as np

import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from scipy import stats
from scipy.optimize import minimize_scalar

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=6, suppress=True)

SEED = 7
rng = np.random.default_rng(SEED)

print("numpy:", np.__version__)
import scipy
print("scipy:", scipy.__version__)
numpy: 1.26.2
scipy: 1.15.0

1) Title & Classification#

  • Name: zipfian (Zipfian / truncated Zipf law)

  • Type: Discrete

  • Support (SciPy default loc=0):

    \[\mathcal{S}=\{1,2,\dots,n\}\]

    With a location shift loc, the support becomes \(\{\text{loc}+1,\dots,\text{loc}+n\}\).

  • Parameter space (SciPy):

    \[a>0,\qquad n\in\{1,2,3,\dots\}\]

    where:

    • \(a\) is the exponent (controls tail heaviness)

    • \(n\) is the maximum rank (finite truncation)

2) Intuition & Motivation#

What it models#

A Zipfian distribution models ranked outcomes where the probability of rank \(k\) decays approximately like a power law:

\[\Pr(X=k) \propto k^{-a}. \]

This captures “few head, long tail” behavior: rank 1 is common, rank 2 is less common, and so on, with many rare high-rank outcomes.

Typical real-world use cases#

  • Word frequencies (Zipf’s law): common words occur extremely often; most words are rare.

  • Popularity / demand: product views, song plays, page visits (ranked by popularity).

  • Discrete heavy tails with a hard cutoff: when only the top \(n\) ranks are possible (finite catalog / vocabulary).

Relations to other distributions#

  • scipy.stats.zipf: the infinite-support Zipf / zeta distribution on \(\{1,2,\dots\}\).

  • Pareto (continuous analogue): power law on \([x_{\min},\infty)\).

  • Truncation: zipfian is essentially a truncated power law; as \(n\to\infty\) and \(a>1\), it approaches the zeta distribution.

3) Formal Definition#

Define the generalized harmonic number

\[H_{n,a}=\sum_{j=1}^{n} j^{-a}. \]

PMF#

For \(k\in\{1,\dots,n\}\),

\[\Pr(X=k\mid a,n) = \frac{k^{-a}}{H_{n,a}}. \]

(And \(\Pr(X=k)=0\) outside the support.)

CDF#

For real \(x\),

\[\begin{split}F(x)=\Pr(X\le x)=\begin{cases} 0, & x<1\\ \frac{1}{H_{n,a}}\sum_{j=1}^{\lfloor x\rfloor} j^{-a}, & 1\le x<n\\ 1, & x\ge n. \end{cases}\end{split}\]

Because the distribution is discrete, the CDF is a step function.

def _validate_a_n(a, n):
    a = float(a)
    if not np.isfinite(a) or a <= 0.0:
        raise ValueError("a must be a finite float > 0")

    if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
        raise TypeError("n must be an integer")
    n_int = int(n)
    if n_int < 1:
        raise ValueError("n must be >= 1")

    return a, n_int


def _logsumexp_np(log_w: np.ndarray) -> float:
    # NumPy-only stable log-sum-exp for 1D arrays.
    log_w = np.asarray(log_w, dtype=float)
    m = np.max(log_w)
    if not np.isfinite(m):
        return float(m)
    return float(m + np.log(np.sum(np.exp(log_w - m))))


def zipfian_logZ(a: float, n: int) -> float:
    # log H_{n,a} computed stably.
    a, n = _validate_a_n(a, n)
    ks = np.arange(1, n + 1, dtype=float)
    return _logsumexp_np(-a * np.log(ks))


def zipfian_logpmf(k, a: float, n: int):
    # Log PMF for the Zipfian distribution on {1,...,n}.
    a, n = _validate_a_n(a, n)

    k_arr = np.asarray(k)
    out = np.full(k_arr.shape, -np.inf, dtype=float)

    k_int = k_arr.astype(int)
    valid = (k_int == k_arr) & (k_int >= 1) & (k_int <= n)
    if not np.any(valid):
        return out

    logZ = zipfian_logZ(a, n)
    out[valid] = -a * np.log(k_int[valid]) - logZ
    return out


def zipfian_pmf(k, a: float, n: int):
    return np.exp(zipfian_logpmf(k, a, n))


def zipfian_pmf_array(a: float, n: int):
    # Return support ks and PMF values as arrays.
    a, n = _validate_a_n(a, n)

    ks = np.arange(1, n + 1)
    logZ = zipfian_logZ(a, n)
    pmf = np.exp(-a * np.log(ks) - logZ)
    pmf = pmf / pmf.sum()  # guard against floating drift
    return ks, pmf


def zipfian_cdf(x, a: float, n: int):
    # CDF as a step function using the PMF array.
    a, n = _validate_a_n(a, n)

    x_arr = np.asarray(x)
    ks, pmf = zipfian_pmf_array(a, n)
    cdf_vals = np.cumsum(pmf)
    cdf_vals = np.clip(cdf_vals, 0.0, 1.0)

    k = np.floor(x_arr).astype(int)
    out = np.zeros_like(x_arr, dtype=float)

    out[x_arr >= n] = 1.0

    inside = (x_arr >= 1) & (x_arr < n)
    if np.any(inside):
        out[inside] = cdf_vals[k[inside] - 1]

    return out


# Quick sanity check
a, n = 1.3, 20
ks, pmf = zipfian_pmf_array(a, n)
print("sum pmf:", pmf.sum())
print("cdf(n):", zipfian_cdf(n, a, n))
sum pmf: 0.9999999999999998
cdf(n): 1.0

4) Moments & Properties#

A convenient way to express moments uses the generalized harmonic numbers.

Raw moments#

For integer \(m\ge 0\),

\[\mathbb{E}[X^m]=\sum_{k=1}^{n} k^m\,\frac{k^{-a}}{H_{n,a}} = \frac{\sum_{k=1}^{n} k^{m-a}}{H_{n,a}} = \frac{H_{n,a-m}}{H_{n,a}}. \]

In particular:

\[\mathbb{E}[X]=\frac{H_{n,a-1}}{H_{n,a}},\qquad \mathbb{E}[X^2]=\frac{H_{n,a-2}}{H_{n,a}}. \]

Mean and variance#

\[\mathrm{Var}(X)=\mathbb{E}[X^2]-\mathbb{E}[X]^2. \]

Because zipfian has finite support, all moments exist for any \(a>0\). However, in the limit \(n\to\infty\) (the zeta distribution), normalization requires \(a>1\) and higher moments require progressively larger \(a\).

MGF and characteristic function#

With finite support,

\[M_X(t)=\mathbb{E}[e^{tX}]=\sum_{k=1}^{n} e^{tk}\,\frac{k^{-a}}{H_{n,a}},\]
\[\varphi_X(t)=\mathbb{E}[e^{itX}]=\sum_{k=1}^{n} e^{itk}\,\frac{k^{-a}}{H_{n,a}}. \]

Entropy#

\[\mathcal{H}(X)=-\sum_{k=1}^{n} p_k\log p_k. \]

Using \(\log p_k=-a\log k-\log H_{n,a}\), we can rewrite

\[\mathcal{H}(X)=\log H_{n,a} + a\,\mathbb{E}[\log X]. \]
def zipfian_raw_moment(m: int, a: float, n: int) -> float:
    a, n = _validate_a_n(a, n)

    ks = np.arange(1, n + 1, dtype=float)
    logk = np.log(ks)
    logZ = zipfian_logZ(a, n)

    # E[X^m] = sum exp((m-a) log k - logZ)
    return float(np.exp(_logsumexp_np((m - a) * logk) - logZ))


def zipfian_moments(a: float, n: int):
    a, n = _validate_a_n(a, n)

    m1 = zipfian_raw_moment(1, a, n)
    m2 = zipfian_raw_moment(2, a, n)
    m3 = zipfian_raw_moment(3, a, n)
    m4 = zipfian_raw_moment(4, a, n)

    var = m2 - m1**2

    if var <= 0.0:
        skew = float("nan")
        excess_kurt = float("nan")
    else:
        mu3 = m3 - 3.0 * m1 * m2 + 2.0 * m1**3
        mu4 = m4 - 4.0 * m1 * m3 + 6.0 * (m1**2) * m2 - 3.0 * m1**4

        skew = mu3 / (var ** 1.5)
        excess_kurt = mu4 / (var**2) - 3.0

    return {
        "mean": m1,
        "var": var,
        "skew": skew,
        "kurtosis": excess_kurt + 3.0,
        "excess_kurt": excess_kurt,
    }


def zipfian_expected_log(a: float, n: int) -> float:
    ks, pmf = zipfian_pmf_array(a, n)
    return float(np.sum(pmf * np.log(ks)))


def zipfian_entropy(a: float, n: int, *, base=math.e) -> float:
    ks, pmf = zipfian_pmf_array(a, n)
    mask = pmf > 0
    H_nats = -np.sum(pmf[mask] * np.log(pmf[mask]))

    if base == math.e:
        return float(H_nats)
    return float(H_nats / math.log(base))


def zipfian_log_mgf(t: float, a: float, n: int) -> float:
    a, n = _validate_a_n(a, n)

    ks = np.arange(1, n + 1, dtype=float)
    logZ = zipfian_logZ(a, n)
    logp = -a * np.log(ks) - logZ

    return _logsumexp_np(logp + t * ks)


def zipfian_mgf(t: float, a: float, n: int) -> float:
    return float(np.exp(zipfian_log_mgf(t, a, n)))


def zipfian_cf(t: float, a: float, n: int) -> complex:
    a, n = _validate_a_n(a, n)

    ks = np.arange(1, n + 1, dtype=float)
    logZ = zipfian_logZ(a, n)
    logp = -a * np.log(ks) - logZ

    return np.sum(np.exp(logp) * np.exp(1j * t * ks))


a, n = 1.2, 50
mom = zipfian_moments(a, n)
H_direct = zipfian_entropy(a, n)
H_via_formula = zipfian_logZ(a, n) + a * zipfian_expected_log(a, n)

{
    **mom,
    "entropy_nats": H_direct,
    "entropy_nats_check": float(H_via_formula),
    "mgf(t=0.1)": zipfian_mgf(0.1, a, n),
    "cf(t=1.0)": zipfian_cf(1.0, a, n),
}
{'mean': 8.483213566153296,
 'var': 123.37129111277592,
 'skew': 1.891549252553807,
 'kurtosis': 5.885075724224906,
 'excess_kurt': 2.8850757242249063,
 'entropy_nats': 2.85361543255586,
 'entropy_nats_check': 2.85361543255586,
 'mgf(t=0.1)': 6.882518427186466,
 'cf(t=1.0)': (0.03586759328104913+0.31936532991899086j)}

5) Parameter Interpretation#

  • Exponent \(a\) (tail heaviness)

    • Smaller \(a\) (closer to 0) makes the distribution flatter (closer to uniform on \(\{1,\dots,n\}\)).

    • Larger \(a\) concentrates mass on small ranks; the tail decays faster.

  • Truncation \(n\) (maximum rank)

    • Increasing \(n\) extends the tail to allow rarer outcomes.

    • For fixed \(a\), increasing \(n\) increases the normalizer \(H_{n,a}\), so the probabilities of low ranks decrease slightly.

A common diagnostic is a log–log plot of PMF vs rank: power laws appear approximately linear with slope \(-a\) (up to truncation effects).

n_fixed = 80
a_values = [0.6, 1.0, 1.6, 2.5]

n_values = [20, 80, 300]
a_fixed = 1.2

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=(
        f"PMF vs a (n={n_fixed})",
        f"PMF vs n (a={a_fixed})",
    ),
)

# Left: vary a
ks = np.arange(1, n_fixed + 1)
for a_ in a_values:
    fig.add_trace(
        go.Scatter(
            x=ks,
            y=zipfian_pmf(ks, a_, n_fixed),
            mode="markers+lines",
            name=f"a={a_}",
        ),
        row=1,
        col=1,
    )

# Right: vary n
for n_ in n_values:
    ks_ = np.arange(1, n_ + 1)
    fig.add_trace(
        go.Scatter(
            x=ks_,
            y=zipfian_pmf(ks_, a_fixed, n_),
            mode="markers+lines",
            name=f"n={n_}",
        ),
        row=1,
        col=2,
    )

fig.update_xaxes(title_text="rank k", type="log", row=1, col=1)
fig.update_yaxes(title_text="P(X=k)", type="log", row=1, col=1)

fig.update_xaxes(title_text="rank k", type="log", row=1, col=2)
fig.update_yaxes(title_text="P(X=k)", type="log", row=1, col=2)

fig.update_layout(title="Zipfian PMF: shape changes", legend_title_text="parameter")
fig.show()

6) Derivations#

Expectation#

\begin{aligned} \mathbb{E}[X] &= \sum_{k=1}^{n} k,\Pr(X=k) = \sum_{k=1}^{n} k,\frac{k^{-a}}{H_{n,a}}\ &= \frac{\sum_{k=1}^{n} k^{1-a}}{H_{n,a}} = \frac{H_{n,a-1}}{H_{n,a}}. \end{aligned}

Variance#

\begin{aligned} \mathrm{Var}(X) &= \mathbb{E}[X^2]-\mathbb{E}[X]^2 = \frac{H_{n,a-2}}{H_{n,a}} - \left(\frac{H_{n,a-1}}{H_{n,a}}\right)^2. \end{aligned}

Likelihood (i.i.d. data)#

Let \(x_1,\dots,x_m\) be i.i.d. with \(x_i\in\{1,\dots,n\}\). The log-likelihood for \(a\) (with \(n\) treated as known) is

\begin{aligned} \ell(a) &= \sum_{i=1}^{m}\log\Pr(X=x_i\mid a,n)\ &= \sum_{i=1}^{m}\left(-a\log x_i-\log H_{n,a}\right)\ &= -a\sum_{i=1}^{m}\log x_i - m\log H_{n,a}. \end{aligned}

If \(n\) is unknown but the support is truncated, we must have \(n\ge\max_i x_i\). For any fixed \(a>0\), \(H_{n,a}\) is increasing in \(n\), which makes the likelihood decrease in \(n\). Hence the MLE for \(n\) is often the boundary:

\[\hat n = \max_i x_i.\]

(Practically, \(n\) is usually known from context: vocabulary size, catalog size, maximum rank considered, etc.)

def _validate_sample(data, n: int) -> np.ndarray:
    if isinstance(n, bool) or not isinstance(n, (int, np.integer)):
        raise TypeError("n must be an integer")
    n = int(n)

    x = np.asarray(data)
    if x.ndim != 1:
        x = x.ravel()

    if np.issubdtype(x.dtype, np.integer):
        x_int = x.astype(int)
    else:
        if np.any(np.floor(x) != x):
            raise ValueError("data must be integer-valued")
        x_int = x.astype(int)

    if np.any(x_int < 1) or np.any(x_int > n):
        raise ValueError(f"all observations must be in {{1,...,{n}}}")

    return x_int


def zipfian_loglik(a: float, data: np.ndarray, n: int) -> float:
    a, n = _validate_a_n(a, n)
    x = _validate_sample(data, n)

    m = x.size
    return float(-a * np.sum(np.log(x)) - m * zipfian_logZ(a, n))


def zipfian_mle_a(data: np.ndarray, n: int, *, bounds=(1e-3, 10.0)):
    # MLE for a with fixed n via 1D optimization.
    _, n = _validate_a_n(1.0, n)  # validate n
    x = _validate_sample(data, n)

    def nll(a):
        if a <= 0:
            return np.inf
        return -zipfian_loglik(a, x, n)

    res = minimize_scalar(nll, bounds=bounds, method="bounded")
    return float(res.x), res


# Demo: generate data and recover a
true_a, true_n = 1.35, 60
x = stats.zipfian.rvs(true_a, true_n, size=20_000, random_state=rng)

# If n is unknown, a common MLE choice is n_hat = max(x)
n_hat = int(x.max())
a_hat, opt_res = zipfian_mle_a(x, n_hat)

{
    "true_a": true_a,
    "true_n": true_n,
    "n_hat": n_hat,
    "a_hat": a_hat,
    "optimizer_success": opt_res.success,
}
{'true_a': 1.35,
 'true_n': 60,
 'n_hat': 60,
 'a_hat': 1.3436470243677288,
 'optimizer_success': True}

7) Sampling & Simulation (NumPy-only)#

Because the support is finite, the most direct NumPy-only sampler uses inverse transform sampling:

  1. Compute the PMF \(p_k\) on \(k=1,\dots,n\).

  2. Compute the CDF \(c_k = \sum_{j\le k} p_j\).

  3. Draw \(U\sim\mathrm{Uniform}(0,1)\).

  4. Return the smallest \(k\) such that \(c_k\ge U\).

This is:

  • \(\mathcal{O}(n)\) to build the CDF (once per parameter setting)

  • \(\mathcal{O}(\log n)\) per sample using np.searchsorted

For very large \(n\) with repeated sampling, an alias method can reduce per-sample cost to \(\mathcal{O}(1)\) after an \(\mathcal{O}(n)\) preprocessing step.

def sample_zipfian_numpy(a: float, n: int, size, rng: np.random.Generator) -> np.ndarray:
    a, n = _validate_a_n(a, n)

    if isinstance(size, tuple):
        out_shape = size
        size_int = int(np.prod(size))
    else:
        size_int = int(size)
        out_shape = (size_int,)

    if size_int < 0:
        raise ValueError("size must be non-negative")

    ks, pmf = zipfian_pmf_array(a, n)
    cdf = np.cumsum(pmf)
    cdf[-1] = 1.0  # guard against floating drift

    u = rng.random(size_int)
    samples = ks[np.searchsorted(cdf, u, side="left")]
    return samples.reshape(out_shape)


a, n = 1.2, 80
samples = sample_zipfian_numpy(a, n, size=100_000, rng=rng)

mom = zipfian_moments(a, n)

{
    "theory_mean": mom["mean"],
    "mc_mean": float(samples.mean()),
    "theory_var": mom["var"],
    "mc_var": float(samples.var(ddof=0)),
}
{'theory_mean': 11.700596007114799,
 'mc_mean': 11.6452,
 'theory_var': 289.1532440210312,
 'mc_var': 286.57043695999994}

8) Visualization#

We’ll visualize:

  • the PMF (often most informative on a log–log scale)

  • the CDF (step function)

  • Monte Carlo samples: empirical frequencies vs theoretical PMF

a, n = 1.25, 100
ks, pmf = zipfian_pmf_array(a, n)

# CDF values on the support
cdf = np.cumsum(pmf)

# Monte Carlo frequencies
m = 60_000
samps = sample_zipfian_numpy(a, n, size=m, rng=rng)
counts = np.bincount(samps, minlength=n + 1)[1:]
emp_pmf = counts / counts.sum()

fig = make_subplots(
    rows=1,
    cols=3,
    subplot_titles=(
        "PMF (log–log)",
        "CDF",
        "Empirical vs theoretical PMF",
    ),
)

# PMF
fig.add_trace(
    go.Scatter(x=ks, y=pmf, mode="markers+lines", name="theory"),
    row=1,
    col=1,
)
fig.update_xaxes(type="log", title_text="rank k", row=1, col=1)
fig.update_yaxes(type="log", title_text="P(X=k)", row=1, col=1)

# CDF
fig.add_trace(
    go.Scatter(x=ks, y=cdf, mode="lines", name="CDF"),
    row=1,
    col=2,
)
fig.update_xaxes(title_text="k", row=1, col=2)
fig.update_yaxes(title_text="P(X≤k)", row=1, col=2)

# Empirical vs theoretical
fig.add_trace(
    go.Scatter(
        x=ks,
        y=emp_pmf,
        mode="markers",
        name="empirical",
        opacity=0.6,
    ),
    row=1,
    col=3,
)
fig.add_trace(
    go.Scatter(
        x=ks,
        y=pmf,
        mode="lines",
        name="theory",
    ),
    row=1,
    col=3,
)
fig.update_xaxes(type="log", title_text="rank k", row=1, col=3)
fig.update_yaxes(type="log", title_text="probability", row=1, col=3)

fig.update_layout(title=f"Zipfian(a={a}, n={n})")
fig.show()

9) SciPy Integration#

SciPy provides the Zipfian distribution as scipy.stats.zipfian.

Common methods:

  • PMF / CDF: stats.zipfian.pmf, stats.zipfian.cdf

  • Sampling: stats.zipfian.rvs

  • Fitting: scipy.stats.fit(stats.zipfian, data, ...)

Notes:

  • zipfian(a, n) is finite-support on \(\{1,\dots,n\}\).

  • zipf(a) is the infinite-support zeta distribution (requires \(a>1\)).

a, n = 1.3, 50

dist = stats.zipfian(a, n)  # frozen distribution
print("support:", dist.support())
print("pmf([1,2,10]):", dist.pmf([1, 2, 10]))
print("cdf([1,2,10]):", dist.cdf([1, 2, 10]))

rvs = dist.rvs(size=10, random_state=rng)
print("rvs:", rvs)

# Moments from SciPy
mean_s, var_s, skew_s, kurt_s = stats.zipfian.stats(a, n, moments="mvsk")
print("SciPy mean/var/skew/kurtosis:", mean_s, var_s, skew_s, kurt_s)

# Fit (MLE by default)
data = stats.zipfian.rvs(1.25, 80, size=15_000, random_state=rng)
fit_res = stats.fit(stats.zipfian, data, bounds={"a": (0.05, 10.0), "n": (2, 500)})
print(fit_res)
print("a_hat, n_hat:", fit_res.params.a, fit_res.params.n)
support: (1, 50)
pmf([1,2,10]): [0.344329 0.139841 0.017257]
cdf([1,2,10]): [0.344329 0.48417  0.787082]
rvs: [44  7 23  1  1  9  1  2  5  1]
SciPy mean/var/skew/kurtosis: 7.347698178395121 105.22290111165175 2.1749257170027665 4.278884110189842
  params: FitParams(a=1.2397758165626769, n=80.0, loc=0.0)
 success: True
 message: 'Optimization terminated successfully.'
a_hat, n_hat: 1.2397758165626769 80.0

10) Statistical Use Cases#

A) Hypothesis testing (goodness-of-fit)#

A common question: does a Zipfian model explain these ranks?

A simple approach:

  1. Fit \(a\) (and possibly \(n\)).

  2. Compare observed counts to expected counts under the fitted model.

  3. Use a chi-square test (or a likelihood-ratio / G-test).

Because parameters are estimated from the same data, classical p-values can be optimistic; a more careful approach uses a parametric bootstrap.

B) Bayesian modeling#

Treat \(a\) as unknown with a prior, e.g. \(a\sim\mathrm{Gamma}(\alpha_0,\beta_0)\). Then

\[p(a\mid x) \propto p(x\mid a)\,p(a).\]

For 1D \(a\), a grid posterior is straightforward.

C) Generative modeling#

Zipfian sampling is a standard building block for synthetic datasets with:

  • head–tail imbalance

  • realistic rank-frequency behavior

Examples include generating synthetic word IDs, item popularity, or request patterns.

# A) Chi-square goodness-of-fit demo
true_a, true_n = 1.15, 25
x = stats.zipfian.rvs(true_a, true_n, size=30_000, random_state=rng)

# Fit a with n fixed (here: assume n known)
a_hat, _ = zipfian_mle_a(x, true_n, bounds=(0.05, 5.0))

obs = np.bincount(x, minlength=true_n + 1)[1:]
ks, pmf_hat = zipfian_pmf_array(a_hat, true_n)
exp = x.size * pmf_hat

chi2_stat, p_value = stats.chisquare(f_obs=obs, f_exp=exp)
{
    "true_a": true_a,
    "a_hat": a_hat,
    "chi2": float(chi2_stat),
    "p_value_naive": float(p_value),
}
{'true_a': 1.15,
 'a_hat': 1.151112295598441,
 'chi2': 23.181215720032025,
 'p_value_naive': 0.5091224145363746}
# B) Bayesian modeling of a via a grid posterior (n fixed)

x = stats.zipfian.rvs(1.3, 60, size=8_000, random_state=rng)
n = 60

# Prior: Gamma(alpha0, rate=beta0)
alpha0, beta0 = 2.0, 1.0

a_grid = np.linspace(0.2, 4.0, 500)
log_prior = stats.gamma.logpdf(a_grid, a=alpha0, scale=1.0 / beta0)
log_like = np.array([zipfian_loglik(a, x, n) for a in a_grid])
log_post_unnorm = log_prior + log_like

# Normalize on the grid
logZ_post = _logsumexp_np(log_post_unnorm)
post = np.exp(log_post_unnorm - logZ_post)
post = post / post.sum()

post_mean = float(np.sum(a_grid * post))
post_cdf = np.cumsum(post)
ci_low = float(a_grid[np.searchsorted(post_cdf, 0.05)])
ci_high = float(a_grid[np.searchsorted(post_cdf, 0.95)])

fig = go.Figure(
    data=[go.Scatter(x=a_grid, y=post, mode="lines", name="posterior")],
    layout=dict(
        title=f"Posterior over a (n={n}, Gamma({alpha0},{beta0}) prior)",
        xaxis_title="a",
        yaxis_title="density (grid-normalized)",
    ),
)
fig.add_vline(x=post_mean, line_dash="dash", line_color="black")
fig.add_vrect(x0=ci_low, x1=ci_high, fillcolor="gray", opacity=0.2, line_width=0)
fig.show()

{"posterior_mean": post_mean, "90%_credible_interval": (ci_low, ci_high)}
{'posterior_mean': 1.295206090499907,
 '90%_credible_interval': (1.2813627254509017, 1.311823647294589)}
# C) Generative modeling: synthetic rank-frequency data

a, n_vocab = 1.1, 2_000
n_tokens = 120_000
ranks = sample_zipfian_numpy(a, n_vocab, size=n_tokens, rng=rng)

counts = np.bincount(ranks, minlength=n_vocab + 1)[1:]
freq = counts / counts.sum()
rank = np.arange(1, n_vocab + 1)

# Theoretical PMF for comparison
_, pmf_theory = zipfian_pmf_array(a, n_vocab)

fig = go.Figure()
fig.add_trace(go.Scatter(x=rank, y=freq, mode="markers", name="empirical", opacity=0.6))
fig.add_trace(go.Scatter(x=rank, y=pmf_theory, mode="lines", name="theory"))

fig.update_xaxes(type="log", title_text="rank")
fig.update_yaxes(type="log", title_text="frequency")
fig.update_layout(title="Synthetic rank-frequency (log–log)")
fig.show()

11) Pitfalls#

  • Invalid parameters

    • SciPy’s zipfian expects a > 0 and integer n >= 1.

    • If you treat \(n\) as unknown from i.i.d. data, the likelihood typically pushes to \(\hat n = \max x_i\) (a boundary estimate). In practice, choose \(n\) from domain knowledge.

  • Numerical issues

    • For large \(n\) or extreme \(a\), naive computations of \(k^{-a}\) can underflow/overflow. Using log-space (as in zipfian_logpmf) is more stable.

    • Building the full PMF/CDF is \(\mathcal{O}(n)\) memory/time. For very large \(n\) and repeated sampling, consider alias sampling or SciPy’s implementation.

  • Model mismatch

    • Many empirical “Zipf-like” datasets deviate in the head or tail (mixtures, cutoffs, measurement bias). Always inspect residuals / goodness-of-fit.

12) Summary#

  • zipfian(a, n) is a discrete truncated power law on ranks \(\{1,\dots,n\}\) with PMF \(p_k \propto k^{-a}\).

  • The normalizer is the generalized harmonic number \(H_{n,a}\).

  • Finite support implies all moments exist, with compact expressions via harmonic numbers.

  • Inverse-CDF sampling gives a simple NumPy-only sampler.

  • SciPy’s scipy.stats.zipfian provides PMF/CDF/RVS and MLE-style fitting via scipy.stats.fit.